Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auditbeat: Fixes for system/socket dataset #19033

Merged
merged 7 commits into from
Jun 9, 2020

Conversation

adriansr
Copy link
Contributor

@adriansr adriansr commented Jun 8, 2020

What does this PR do?

Fixes two problems with the system/socket dataset:

  • A bug in the internal state of the socket dataset that lead to an infinite loop in systems were the kernel aggressively reuses sockets (observed in 2.6 / CentOS/RHEL 6.x).

  • Socket expiration wasn't working as expected due to it using an uninitialized timestamp: Flows were expiring at every check.

Also fixes other two minor issues:

  • A flow could be terminated twice by different code paths leading to wrong numFlows calculation and duplicated flows indexed.
  • Decoupled the status debug log and socket cleanup into separate goroutines so that logging is still performed under high load situations.

Why is it important?

It has been observed that the dataset would use 100% CPU and stop reporting events. During testing it was discovered that socket expiration, a new feature to prevent excessive memory usage, wasn't working as expected.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

The infinite loop is easy to trigger in RHEL 6.x by running:

nmap -n -sT 127.0.0.1 -p 1-65535

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 8, 2020
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jun 8, 2020
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jun 8, 2020

💚 Build Succeeded

Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: [Pull request #19033 event]

  • Start Time: 2020-06-09T13:43:35.683+0000

  • Duration: 42 min 36 sec

Test stats 🧪

Test Results
Failed 0
Passed 736
Skipped 176
Total 912

@adriansr adriansr changed the title Socket diag Auditbeat: Fix infinite loop in system/socket dataset Jun 8, 2020
@adriansr adriansr marked this pull request as ready for review June 8, 2020 19:27
@adriansr adriansr requested a review from a team as a code owner June 8, 2020 19:27
@elasticmachine
Copy link
Collaborator

Pinging @elastic/siem (Team:SIEM)

@adriansr adriansr added the review label Jun 8, 2020
Copy link

@andrewstucki andrewstucki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for a bit deeper explanation for a bit of posterity sake.

So, it looks like this happens when you get kernel pointer re-use of the sockets after a missed inet_release syscall. Our old clean-up code failed to remove the socket from the socketLRU but overwrote the map lookup value in the underlying sockets map. When the reaper code for cleaning up old sockets then hit, the orphaned record in the socketLRU would be referenced along with its kernel-based pointer which now pointed to a new socket. As a result the reference in the socketLRU would never get removed and would get evaluated again and again via the Peek call in the for loop. The fix works because any time we expire a socket we now also explicitly remove it from the socketLRU (which a socket reference is always added to when the socket is created) and mark it in a closing state.

Does that sound about right? If so, would it be possible to add a simple test for this condition? Basically flow --> missed inet_release --> reused kernel pointer uint64 --> make sure we have the old socket reference in a closing state?

The feature was using socket.closeTime as a reference for expiration,
but this timestamp was only set once the socket was closed or expired,
so it caused all sockets to expirate every closeTimeout.
@adriansr
Copy link
Contributor Author

adriansr commented Jun 9, 2020

@andrewstucki while adding the test I found yet another problem.

It wasn't dealing with socket timeouts properly, see the new commit.

Can you have another look?

@adriansr adriansr changed the title Auditbeat: Fix infinite loop in system/socket dataset Auditbeat: Fixes for system/socket dataset Jun 9, 2020
Copy link

@andrewstucki andrewstucki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

painful 😬 thanks for tracking these down and adding the tests @adriansr

@adriansr
Copy link
Contributor Author

adriansr commented Jun 9, 2020

yep, the whole state.go should be rewritten from scratch :D

@adriansr adriansr merged commit 665b67f into elastic:master Jun 9, 2020
@adriansr adriansr deleted the socket_diag branch June 9, 2020 16:05
adriansr added a commit to adriansr/beats that referenced this pull request Jun 9, 2020
Fixes two problems with the system/socket dataset:

- A bug in the internal state of the socket dataset that lead to an infinite
  loop in systems were the kernel aggressively reuses sockets (observed
  in kernel 2.6 / CentOS/RHEL 6.x).
- Socket expiration wasn't working as expected due to it using an
  uninitialized timestamp: Flows were expiring at every check.

Also fixes other two minor issues:

- A flow could be terminated twice by different code paths leading to wrong
  numFlows calculation and duplicated flows indexed.
- Decoupled the status debug log and socket cleanup into separate goroutines
  so that logging is still performed under high load situations.

(cherry picked from commit 665b67f)
@adriansr adriansr added the v7.9.0 label Jun 9, 2020
adriansr added a commit to adriansr/beats that referenced this pull request Jun 9, 2020
Fixes two problems with the system/socket dataset:

- A bug in the internal state of the socket dataset that lead to an infinite
  loop in systems were the kernel aggressively reuses sockets (observed
  in kernel 2.6 / CentOS/RHEL 6.x).
- Socket expiration wasn't working as expected due to it using an
  uninitialized timestamp: Flows were expiring at every check.

Also fixes other two minor issues:

- A flow could be terminated twice by different code paths leading to wrong
  numFlows calculation and duplicated flows indexed.
- Decoupled the status debug log and socket cleanup into separate goroutines
  so that logging is still performed under high load situations.

(cherry picked from commit 665b67f)
@adriansr adriansr added the v7.8.0 label Jun 9, 2020
adriansr added a commit to adriansr/beats that referenced this pull request Jun 9, 2020
Fixes two problems with the system/socket dataset:

- A bug in the internal state of the socket dataset that lead to an infinite
  loop in systems were the kernel aggressively reuses sockets (observed
  in kernel 2.6 / CentOS/RHEL 6.x).
- Socket expiration wasn't working as expected due to it using an
  uninitialized timestamp: Flows were expiring at every check.

Also fixes other two minor issues:

- A flow could be terminated twice by different code paths leading to wrong
  numFlows calculation and duplicated flows indexed.
- Decoupled the status debug log and socket cleanup into separate goroutines
  so that logging is still performed under high load situations.

(cherry picked from commit 665b67f)
@adriansr adriansr added the v7.7.2 label Jun 9, 2020
adriansr added a commit that referenced this pull request Jun 9, 2020
Fixes two problems with the system/socket dataset:

- A bug in the internal state of the socket dataset that lead to an infinite
  loop in systems were the kernel aggressively reuses sockets (observed
  in kernel 2.6 / CentOS/RHEL 6.x).
- Socket expiration wasn't working as expected due to it using an
  uninitialized timestamp: Flows were expiring at every check.

Also fixes other two minor issues:

- A flow could be terminated twice by different code paths leading to wrong
  numFlows calculation and duplicated flows indexed.
- Decoupled the status debug log and socket cleanup into separate goroutines
  so that logging is still performed under high load situations.

(cherry picked from commit 665b67f)
adriansr added a commit that referenced this pull request Jun 9, 2020
Fixes two problems with the system/socket dataset:

- A bug in the internal state of the socket dataset that lead to an infinite
  loop in systems were the kernel aggressively reuses sockets (observed
  in kernel 2.6 / CentOS/RHEL 6.x).
- Socket expiration wasn't working as expected due to it using an
  uninitialized timestamp: Flows were expiring at every check.

Also fixes other two minor issues:

- A flow could be terminated twice by different code paths leading to wrong
  numFlows calculation and duplicated flows indexed.
- Decoupled the status debug log and socket cleanup into separate goroutines
  so that logging is still performed under high load situations.

(cherry picked from commit 665b67f)
adriansr added a commit that referenced this pull request Jun 9, 2020
Fixes two problems with the system/socket dataset:

- A bug in the internal state of the socket dataset that lead to an infinite
  loop in systems were the kernel aggressively reuses sockets (observed
  in kernel 2.6 / CentOS/RHEL 6.x).
- Socket expiration wasn't working as expected due to it using an
  uninitialized timestamp: Flows were expiring at every check.

Also fixes other two minor issues:

- A flow could be terminated twice by different code paths leading to wrong
  numFlows calculation and duplicated flows indexed.
- Decoupled the status debug log and socket cleanup into separate goroutines
  so that logging is still performed under high load situations.

(cherry picked from commit 665b67f)
melchiormoulin pushed a commit to melchiormoulin/beats that referenced this pull request Oct 14, 2020
Fixes two problems with the system/socket dataset:

- A bug in the internal state of the socket dataset that lead to an infinite
  loop in systems were the kernel aggressively reuses sockets (observed
  in kernel 2.6 / CentOS/RHEL 6.x).
- Socket expiration wasn't working as expected due to it using an
  uninitialized timestamp: Flows were expiring at every check.

Also fixes other two minor issues:

- A flow could be terminated twice by different code paths leading to wrong
  numFlows calculation and duplicated flows indexed.
- Decoupled the status debug log and socket cleanup into separate goroutines
  so that logging is still performed under high load situations.
@ssfxx
Copy link

ssfxx commented Mar 16, 2023

@adriansr auditbeat version 7.17.0 (amd64), libbeat 7.17.0 system/socket Peak CPU Usage 100%+,mean value Cpu Usage 40%+.
The server events 3000+ per minute.
Please help!
Thanks!

leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
…9081)

Fixes two problems with the system/socket dataset:

- A bug in the internal state of the socket dataset that lead to an infinite
  loop in systems were the kernel aggressively reuses sockets (observed
  in kernel 2.6 / CentOS/RHEL 6.x).
- Socket expiration wasn't working as expected due to it using an
  uninitialized timestamp: Flows were expiring at every check.

Also fixes other two minor issues:

- A flow could be terminated twice by different code paths leading to wrong
  numFlows calculation and duplicated flows indexed.
- Decoupled the status debug log and socket cleanup into separate goroutines
  so that logging is still performed under high load situations.

(cherry picked from commit 9555ff4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants